"Draw My Topics": Find Desired Topics fast from large scale of Corpus

نویسندگان

  • Jason Dou
  • Ni Sun
  • Xiaojun Zou
چکیده

We develop the “Draw My-Topics” Toolkit, which provides a fast way to incorporate social scientists’ concerns and interests into the standard topic model. Instead of using raw corpus with primitive processing as input, an algorithm based on Vector Space Model and Conditional Entropy are used to connect social scientists’ subjective want and the unsupervised topic models’ output. Space for users’ adjustment on specific corpus of their interest is accommodated in our algorithm. We demonstrate the toolkit’s use on the Diachronic People’s Daily Corpus in Chinese. Several interesting “central words” like “Enlai Zhou” (First PRC premier minister) and “Cultural Revolution” which may be interested of social scientists from different disciplines and the original corpus are used as input of our toolkit, then the most related topics are present efficiently for further research purpose.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term cooccurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SW...

متن کامل

Stylex: a corpus of educational videos for research on speaking styles and their impact on engagement and learning

In the context of learning through educational videos, the material chosen for a given topic must not only be relevant but also engaging to the consumer—ensuring better understanding and retention of content. This paper focuses on the speaking style of instructors, which is an important aspect driving student engagement. We present StyleX, a corpus of 450 1-minute video clips featuring 50 instr...

متن کامل

Selected topics from 40 years of research on speech and speaker recognition

This paper summarizes my 40 years of research on speech and speaker recognition, focusing on selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo Institute of Technology with my colleagues and students. These topics include: the importance of spectral dynamics in speech perception; speaker recognition methods using statistical features, cepstral features, an...

متن کامل

Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Topic modeling is an increasingly important component of Big Data analytics, enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM), while mathematically elegant, do not lend themselves well to direct parallelization because of dependencies from one time step to another. Data decomposition approaches that partition ...

متن کامل

Topic Detection, Ranking and Modeling Evolution in Bibliographic Datasets

Topic detection in a text corpus is the detection of semantic units from the underlying texts that can function as building blocks of knowledge discovery. Topic detection provides a powerful tool for text summarization and information navigation across a corpus of documents. Topic detection from text documents using statistical models and natural language processing techniques has been extensiv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1602.01428  شماره 

صفحات  -

تاریخ انتشار 2014